-
Health Psychology and Behavioral... May 2021Dependent variables in health psychology are often counts, for example, of a behaviour or number of engagements with an intervention. These counts can be very strongly...
Dependent variables in health psychology are often counts, for example, of a behaviour or number of engagements with an intervention. These counts can be very strongly skewed, and/or contain large numbers of zeros as well as extreme outliers. For example, 'How many cigarettes do you smoke on an average day?' The modal answer may be zero but may range from 0 to 40+. The same can be true for minutes of moderate-to-vigorous physical activity. For some people, this may be near zero, but take on extreme values for someone training for a marathon. Typical analytical strategies for this data involve explicit (or implied) transformations (smoker v. non-smoker, log transformations). However, these data types are 'counts' (i.e. non-negative whole numbers) or quasi-counts (time is ratio but discrete minutes of activity could be analysed as a count), and can be modelled using count distributions - including the Poisson and negative binomial distribution (and their zero-inflated and hurdle extensions, which alloweven more zeros). In this tutorial paper I demonstrate (in R, Jamovi, and SPSS) the easy application of these models to health psychology data, and their advantages over alternative ways of analysing this type of data using two datasets - one highly dispersed dependent variable (number of views on YouTube, and another with a large number of zeros (number of days on which symptoms were reported over a month). The negative binomial distribution had the best fit for the overdispersed number of views on YouTube. Negative binomial, and zero-inflated negative binomial were both good fits for the symptom data with over-abundant zeros. In both cases, count distributions provided not just a better fit but would lead to different conclusions compared to the poorly fitting traditional regression/linear models.
PubMed: 34104569
DOI: 10.1080/21642850.2021.1920416 -
Journal of Applied Statistics 2020In this work, we study a linear birth-death process starting from random initial conditions. First, we consider these initial conditions as a random number of particles... (Review)
Review
In this work, we study a linear birth-death process starting from random initial conditions. First, we consider these initial conditions as a random number of particles following different standard probabilistic distributions - Negative-Binomial and its closest Geometric, Poisson or Pólya-Aeppli distributions. It is proved analytically and numerically that in these cases the random number of particles alive at any positive time follows the same probability law like the initial condition, but with different parameters depending on time. The random initial conditions cannot change the critical parameter of branching mechanism, but they impact the extinction probability. Finally, the numerical model is extended to an application for studying branching processes with more complex initial conditions. This is demonstrated with a linear birth-death process initialised with Pólya urn sampling scheme. The obtained preliminary results for particle distribution show close relation to Pólya-Aeppli distribution.
PubMed: 35707432
DOI: 10.1080/02664763.2020.1732309 -
PLoS Computational Biology Feb 2024Outbreaks of emerging and zoonotic infections represent a substantial threat to human health and well-being. These outbreaks tend to be characterised by highly...
Outbreaks of emerging and zoonotic infections represent a substantial threat to human health and well-being. These outbreaks tend to be characterised by highly stochastic transmission dynamics with intense variation in transmission potential between cases. The negative binomial distribution is commonly used as a model for transmission in the early stages of an epidemic as it has a natural interpretation as the convolution of a Poisson contact process and a gamma-distributed infectivity. In this study we expand upon the negative binomial model by introducing a beta-Poisson mixture model in which infectious individuals make contacts at the points of a Poisson process and then transmit infection along these contacts with a beta-distributed probability. We show that the negative binomial distribution is a limit case of this model, as is the zero-inflated Poisson distribution obtained by combining a Poisson-distributed contact process with an additional failure probability. We assess the beta-Poisson model's applicability by fitting it to secondary case distributions (the distribution of the number of subsequent cases generated by a single case) estimated from outbreaks covering a range of pathogens and geographical settings. We find that while the beta-Poisson mixture can achieve a closer to fit to data than the negative binomial distribution, it is consistently outperformed by the negative binomial in terms of Akaike Information Criterion, making it a suboptimal choice on parsimonious grounds. The beta-Poisson performs similarly to the negative binomial model in its ability to capture features of the secondary case distribution such as overdispersion, prevalence of superspreaders, and the probability of a case generating zero subsequent cases. Despite this possible shortcoming, the beta-Poisson distribution may still be of interest in the context of intervention modelling since its structure allows for the simulation of measures which change contact structures while leaving individual-level infectivity unchanged, and vice-versa.
Topics: Humans; Models, Statistical; Computer Simulation; Poisson Distribution; Binomial Distribution; Disease Outbreaks
PubMed: 38330050
DOI: 10.1371/journal.pcbi.1011856 -
Fa Yi Xue Za Zhi Jun 2021Objective To derive the probability distribution formula of combined identity by state (CIBS) score among individuals with different relationships based on...
Objective To derive the probability distribution formula of combined identity by state (CIBS) score among individuals with different relationships based on population data of autosomal multiallelic genetic markers. Methods The probabilities of different identity by state (IBS) scores occurring at a single locus between two individuals with different relationships were derived based on the principle of ITO method. Then the distribution probability formula of CIBS score between two individuals with different relationships when a certain number of genetic markers were used for relationship identification was derived based on the multinomial distribution theory. The formula was compared with the CIBS probability distribution formula based on binomial distribution theory. Results Between individuals with a certain relationship, labelled as RS, the probabilities of IBS=2, 1 and 0 occurring at a certain autosomal genetic marker x (that is, , and ), can be calculated based on the allele frequency data of that genetic marker and the probability of two individuals with the corresponding RS relationship sharing 0, 1 or 2 identity by descent (IBD) alleles (that is, , and ). For a genotyping system with multiple independent genetic markers, the distribution of CIBS score between pairs of individuals with relationships other than parent-child can be deducted using the averages of the 3 probabilities of all genetic markers (that is, , and ), based on multinomial distribution theory. Conclusion The calculation of CIBS score distribution formula can be extended to all kinships and has great application value in case interpretation and system effectiveness evaluation. In most situations, the results based on binomial distribution formula are similar to those based on the formula derived in this study, thus, there is little difference between the two methods in actual work.
Topics: Alleles; Gene Frequency; Genetic Markers; Genotype; Humans; Probability
PubMed: 34379907
DOI: 10.12116/j.issn.1004-5619.2020.500311 -
BMC Bioinformatics May 2023The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using...
BACKGROUND
The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate.
RESULTS
We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice.
CONCLUSIONS
With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .
Topics: Male; Humans; Mutation; Algorithms; Binomial Distribution; Breast; Computer Simulation
PubMed: 37158829
DOI: 10.1186/s12859-023-05304-1 -
Journal of Applied Statistics 2021Control charts are widely used for monitoring quality characteristics of high-yield processes. In such processes where a large number of zero observations exists in...
Control charts are widely used for monitoring quality characteristics of high-yield processes. In such processes where a large number of zero observations exists in count data, the zero-inflated binomial (ZIB) models are more appropriate than the ordinary binomial models. In ZIB models, random shocks occur with probability , and upon the occurrence of random shocks, the number of non-conforming items in a sample of size follows the binomial distribution with proportion . In the present article, we study in more detail the exponentially weighted moving average control chart based on ZIB distribution (ZIB-EWMA) and we also propose a new control chart based on the double exponentially weighted moving average statistic for monitoring ZIB data (ZIB-DEWMA). The two control charts are studied in detecting upward shifts in or individually, as well as in both parameters simultaneously. Through a simulation study, we compare the performance of the proposed chart with the ZIB-Shewhart, ZIB-EWMA and ZIB-CUSUM charts. Finally, an illustrative example is also presented to display the practical application of the ZIB charts.
PubMed: 35706893
DOI: 10.1080/02664763.2020.1761950 -
Psychometrika Sep 2020Multi-layer networks arise when more than one type of relation is observed on a common set of actors. Modeling such networks within the exponential-family random graph...
Multi-layer networks arise when more than one type of relation is observed on a common set of actors. Modeling such networks within the exponential-family random graph (ERG) framework has been previously limited to special cases and, in particular, to dependence arising from just two layers. Extensions to ERGMs are introduced to address these limitations: Conway-Maxwell-Binomial distribution to model the marginal dependence among multiple layers; a "layer logic" language to translate familiar ERGM effects to substantively meaningful interactions of observed layers; and nondegenerate triadic and degree effects. The developments are demonstrated on two previously published datasets.
Topics: Language; Models, Statistical; Psychometrics
PubMed: 33025459
DOI: 10.1007/s11336-020-09720-7 -
Statistical Methods in Medical Research Jul 2023The zero-inflated negative binomial distribution has been widely used for count data analyses in various biomedical settings due to its capacity of modeling excess zeros...
The zero-inflated negative binomial distribution has been widely used for count data analyses in various biomedical settings due to its capacity of modeling excess zeros and overdispersion. When there are correlated count variables, a bivariate model is essential for understanding their full distributional features. Examples include measuring correlation of two genes in sparse single-cell RNA sequencing data and modeling dental caries count indices on two different tooth surface types. For these purposes, we develop a richly parametrized bivariate zero-inflated negative binomial model that has a simple latent variable framework and eight free parameters with intuitive interpretations. In the scRNA-seq data example, the correlation is estimated after adjusting for the effects of dropout events represented by excess zeros. In the dental caries data, we analyze how the treatment with Xylitol lozenges affects the marginal mean and other patterns of response manifested in the two dental caries traits. An R package "bzinb" is available on Comprehensive R Archive Network.
Topics: Humans; Dental Caries; Models, Statistical; Binomial Distribution; Data Analysis; Poisson Distribution
PubMed: 37167422
DOI: 10.1177/09622802231172028 -
Statistical Methods in Medical Research Mar 2023Changes in cognitive function over time are of interest in ageing research. A joint model is constructed to investigate. Generally, cognitive function is measured...
Changes in cognitive function over time are of interest in ageing research. A joint model is constructed to investigate. Generally, cognitive function is measured through more than one test, and the test scores are integers. The aim is to investigate two test scores and use an extension of a bivariate binomial distribution to define a new joint model. This bivariate distribution model the correlation between the two test scores. To deal with attrition due to death, the Weibull hazard model and the Gompertz hazard model are used. A shared random-effects model is constructed, and the random effects are assumed to follow a bivariate normal distribution. It is shown how to incorporate random effects that link the bivariate longitudinal model and the survival model. The joint model is applied to the English Longitudinal Study of Ageing data.
Topics: Longitudinal Studies; Proportional Hazards Models; Binomial Distribution; Cognition; Models, Statistical
PubMed: 36573012
DOI: 10.1177/09622802221146307 -
MedRxiv : the Preprint Server For... Nov 2020The number of secondary cases is an important parameter for the control of infectious diseases. When individual variation in disease transmission is present, like for...
The number of secondary cases is an important parameter for the control of infectious diseases. When individual variation in disease transmission is present, like for COVID-19, the number of secondary cases is often modelled using a negative binomial distribution. However, this may not be the best distribution to describe the underlying transmission process. We propose the use of three other offspring distributions to quantify heterogeneity in transmission, and we assess the possible bias in estimates of the offspring mean and its overdispersion when the data generating distribution is different from the one used for inference. We find that overdispersion estimates may be biased when there is a substantial amount of heterogeneity, and that the use of other distributions besides the negative binomial should be considered. We revisit three previously analysed COVID-19 datasets and quantify the proportion of cases responsible for 80% of transmission, , while acknowledging the variation arising from the assumed offspring distribution. We find that the number of secondary cases for these datasets is better described by a Poisson-lognormal distribution.
PubMed: 34013290
DOI: 10.1101/2020.11.27.20239657